{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f00c80aa",
   "metadata": {},
   "source": [
    "Installing (updating) the following libraries for your Sagemaker\n",
    "instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0df41947",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install .. # installing d2l\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77479af3",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# 异步计算\n",
    ":label:`sec_async`\n",
    "\n",
    "今天的计算机是高度并行的系统，由多个CPU核、多个GPU、多个处理单元组成。通常每个CPU核有多个线程，每个设备通常有多个GPU，每个GPU有多个处理单元。总之，我们可以同时处理许多不同的事情，并且通常是在不同的设备上。不幸的是，Python并不善于编写并行和异步代码，至少在没有额外帮助的情况下不是好选择。归根结底，Python是单线程的，将来也是不太可能改变的。因此在诸多的深度学习框架中，MXNet和TensorFlow之类则采用了一种*异步编程*（asynchronous programming）模型来提高性能，而PyTorch则使用了Python自己的调度器来实现不同的性能权衡。对PyTorch来说GPU操作在默认情况下是异步的。当调用一个使用GPU的函数时，操作会排队到特定的设备上，但不一定要等到以后才执行。这允许我们并行执行更多的计算，包括在CPU或其他GPU上的操作。\n",
    "\n",
    "因此，了解异步编程是如何工作的，通过主动地减少计算需求和相互依赖，有助于我们开发更高效的程序。这能够减少内存开销并提高处理器利用率。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "66ebecda",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:29:23.676819Z",
     "iopub.status.busy": "2023-08-18T07:29:23.676275Z",
     "iopub.status.idle": "2023-08-18T07:29:26.719058Z",
     "shell.execute_reply": "2023-08-18T07:29:26.717749Z"
    },
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "import numpy\n",
    "import torch\n",
    "from torch import nn\n",
    "from d2l import torch as d2l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86ca52da",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "## 通过后端异步处理\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fbdcd2b",
   "metadata": {
    "origin_pos": 6,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "作为热身，考虑一个简单问题：生成一个随机矩阵并将其相乘。让我们在NumPy和PyTorch张量中都这样做，看看它们的区别。请注意，PyTorch的`tensor`是在GPU上定义的。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e4c20b11",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:29:26.723694Z",
     "iopub.status.busy": "2023-08-18T07:29:26.723007Z",
     "iopub.status.idle": "2023-08-18T07:29:29.882717Z",
     "shell.execute_reply": "2023-08-18T07:29:29.881143Z"
    },
    "origin_pos": 9,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "numpy: 1.0704 sec\n",
      "torch: 0.0013 sec\n"
     ]
    }
   ],
   "source": [
    "# GPU计算热身\n",
    "device = d2l.try_gpu()\n",
    "a = torch.randn(size=(1000, 1000), device=device)\n",
    "b = torch.mm(a, a)\n",
    "\n",
    "with d2l.Benchmark('numpy'):\n",
    "    for _ in range(10):\n",
    "        a = numpy.random.normal(size=(1000, 1000))\n",
    "        b = numpy.dot(a, a)\n",
    "\n",
    "with d2l.Benchmark('torch'):\n",
    "    for _ in range(10):\n",
    "        a = torch.randn(size=(1000, 1000), device=device)\n",
    "        b = torch.mm(a, a)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8188fde",
   "metadata": {
    "origin_pos": 12,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "通过PyTorch的基准输出比较快了几个数量级。NumPy点积是在CPU上执行的，而PyTorch矩阵乘法是在GPU上执行的，后者的速度要快得多。但巨大的时间差距表明一定还有其他原因。默认情况下，GPU操作在PyTorch中是异步的。强制PyTorch在返回之前完成所有计算，这种强制说明了之前发生的情况：计算是由后端执行，而前端将控制权返回给了Python。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "78106858",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:29:29.891458Z",
     "iopub.status.busy": "2023-08-18T07:29:29.890289Z",
     "iopub.status.idle": "2023-08-18T07:29:29.904366Z",
     "shell.execute_reply": "2023-08-18T07:29:29.902435Z"
    },
    "origin_pos": 15,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done: 0.0049 sec\n"
     ]
    }
   ],
   "source": [
    "with d2l.Benchmark():\n",
    "    for _ in range(10):\n",
    "        a = torch.randn(size=(1000, 1000), device=device)\n",
    "        b = torch.mm(a, a)\n",
    "    torch.cuda.synchronize(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb45905d",
   "metadata": {
    "origin_pos": 18,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "广义上说，PyTorch有一个用于与用户直接交互的前端（例如通过Python），还有一个由系统用来执行计算的后端。如 :numref:`fig_frontends`所示，用户可以用各种前端语言编写PyTorch程序，如Python和C++。不管使用的前端编程语言是什么，PyTorch程序的执行主要发生在C++实现的后端。由前端语言发出的操作被传递到后端执行。后端管理自己的线程，这些线程不断收集和执行排队的任务。请注意，要使其工作，后端必须能够跟踪计算图中各个步骤之间的依赖关系。因此，不可能并行化相互依赖的操作。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "751ec224",
   "metadata": {
    "origin_pos": 20
   },
   "source": [
    "![编程语言前端和深度学习框架后端](../img/frontends.png)\n",
    ":width:`300px`\n",
    ":label:`fig_frontends`\n",
    "\n",
    "接下来看看另一个简单例子，以便更好地理解依赖关系图。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e4b981d5",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:29:29.910704Z",
     "iopub.status.busy": "2023-08-18T07:29:29.910033Z",
     "iopub.status.idle": "2023-08-18T07:29:29.963733Z",
     "shell.execute_reply": "2023-08-18T07:29:29.962149Z"
    },
    "origin_pos": 22,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[3., 3.]], device='cuda:0')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = torch.ones((1, 2), device=device)\n",
    "y = torch.ones((1, 2), device=device)\n",
    "z = x * y + 2\n",
    "z"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47f52925",
   "metadata": {
    "origin_pos": 24
   },
   "source": [
    "![后端跟踪计算图中各个步骤之间的依赖关系](../img/asyncgraph.svg)\n",
    ":label:`fig_asyncgraph`\n",
    "\n",
    "上面的代码片段在 :numref:`fig_asyncgraph`中进行了说明。每当Python前端线程执行前三条语句中的一条语句时，它只是将任务返回到后端队列。当最后一个语句的结果需要被打印出来时，Python前端线程将等待C++后端线程完成变量`z`的结果计算。这种设计的一个好处是Python前端线程不需要执行实际的计算。因此，不管Python的性能如何，对程序的整体性能几乎没有影响。 :numref:`fig_threading`演示了前端和后端如何交互。\n",
    "\n",
    "![前端和后端的交互](../img/threading.svg)\n",
    ":label:`fig_threading`\n",
    "\n",
    "## 障碍器与阻塞器\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c107794",
   "metadata": {
    "origin_pos": 29
   },
   "source": [
    "## 改进计算\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "185ccf3b",
   "metadata": {
    "origin_pos": 32
   },
   "source": [
    "Python前端线程和C++后端线程之间的简化交互可以概括如下：\n",
    "\n",
    "1. 前端命令后端将计算任务`y = x + 1`插入队列；\n",
    "1. 然后后端从队列接收计算任务并执行；\n",
    "1. 然后后端将计算结果返回到前端。\n",
    "\n",
    "假设这三个阶段的持续时间分别为$t_1, t_2, t_3$。如果不使用异步编程，执行10000次计算所需的总时间约为$10000 (t_1+ t_2 + t_3)$。如果使用异步编程，因为前端不必等待后端为每个循环返回计算结果，执行$10000$次计算所花费的总时间可以减少到$t_1 + 10000 t_2 + t_3$（假设$10000 t_2 > 9999t_1$）。\n",
    "\n",
    "\n",
    "## 小结\n",
    "\n",
    "* 深度学习框架可以将Python前端的控制与后端的执行解耦，使得命令可以快速地异步插入后端、并行执行。\n",
    "* 异步产生了一个相当灵活的前端，但请注意：过度填充任务队列可能会导致内存消耗过多。建议对每个小批量进行同步，以保持前端和后端大致同步。\n",
    "* 芯片供应商提供了复杂的性能分析工具，以获得对深度学习效率更精确的洞察。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a353540",
   "metadata": {
    "origin_pos": 34
   },
   "source": [
    "## 练习\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1989c931",
   "metadata": {
    "origin_pos": 36,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "1. 在CPU上，对本节中相同的矩阵乘法操作进行基准测试，仍然可以通过后端观察异步吗？\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85daf71a",
   "metadata": {
    "origin_pos": 39,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/2791)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p36",
   "name": "conda_pytorch_p36"
  },
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}